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Abstract 

Background: Deciphering protein-protein interaction (PPI) in domain level enriches valuable information about 
binding mechanism and functional role of interacting proteins. The 3D structures of complex proteins are reliable 
source of domain-domain interaction (DDI) but the number of proven structures is very limited. Several resources 
for the computationally predicted DDI have been generated but they are scattered in various places and their 
prediction show erratic performances. A well-organized PPI and DDI analysis system integrating these data with fair 
scoring system is necessary. 

Method: We integrated three structure-based DDI datasets and twenty computationally predicted DDI datasets 
and constructed an interaction analysis system, named IDDI, which enables to browse protein and domain 
interactions with their relationships. To integrate heterogeneous DDI information, a novel scoring scheme is 
introduced to determine the reliability of DDI by considering the prediction scores of each DDI and the confidence 
levels of each prediction method in the datasets, and independencies between predicted datasets. In addition, we 
connected this DDI information to the comprehensive PPI information and developed a unified interface for the 
interaction analysis exploring interaction networks at both protein and domain level. 

Result: IDDI provides 204,705 DDIs among total 7,351 Pfam domains in the current version. The result presents 
that total number of DDIs is increased eight times more than that of previous studies. Due to the increment of 
data, 50.4% of PPIs could be correlated with DDIs which is more than twice of previous resources. Newly designed 
scoring scheme outperformed the previous system in its accuracy too. User interface of IDDI system provides 
interactive investigation of proteins and domains in interactions with interconnected way. A specific example is 
presented to show the efficiency of the systems to acquire the comprehensive information of target protein with 
PPI and DDI relationships. IDDI is freely available at http://pcode.kaist.ac.kr/iddi/. 



Background 

Protein interactions, including binary PPIs and co-com- 
plexes, regulate biological process and biochemical reac- 
tions. Discovering protein interactions provides detailed 
interpretation of cellular mechanism of biological func- 
tions. Therefore, the identification of protein interaction 
is a critical issue for biology researchers. Recently, mas- 
sive amount of protein interaction data is available due 
to the advancement of large-scale screening techniques 
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such as yeast two-hybrid, affinity purification followed 
by mass spectrometry. Lots of protein interaction data 
verified from different experimental methods is publi- 
cally available. However, although the increased data 
can give a landscape of the protein interactome, they 
are not much informative in detailed binding mechan- 
isms and high false positive rate of the data is a big hur- 
dle to interpret the interactome [1]. 

Investigating protein interactions in domain level can 
complement these limitations. Proteins consist of one or 
multiple domains thought as functional units of protein. 
In most cases, domain-domain interactions (DDIs) are 
crucial clues of protein interactions. Therefore, DDIs 
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can be key supporting evidences for protein interaction 
mechanisms. 

DDIs first have been identified based on 3-dimen- 
sional (3D) structures of protein complexes from Pro- 
tein Data Bank [2]. 3DID [3], iPfam [4] and PInS [5] 
extract DDIs from the binding regions in known 3D 
structures. However, these datasets cover only a small 
proportion of DDIs due to insufficient available 3D 
structures. DDIs obtained from 3D structures cover less 
than 20% of the PPIs in Escherichea coli, Saccharomyces 
cerevisiae, Caenorhabditis elegans, Drosophila melanoga- 
ster and Homo sapiens [6]. To complement DDIs, var- 
ious computational methods have been proposed to 
predict DDIs in recent years [7-25]. However, it is a 
cumbersome work for individual researchers to gather 
and integrate each predicted dataset because reliability 
of each datasets should be further analyzed since each 
method has different reliability level. Therefore, it is 
necessary to build an integrated system which combines 
all DDIs with a unified reliability scoring scheme. 

Up to now, two combined DDI databases, DOMINE 
[26] and UniDomlnt [27], have been published. DOM- 
INE combined two 3D structure-based DDI datasets and 
thirteen predicted DDI datasets. Confidence level of 
each predicted DDIs in DOMINE is classified as High, 
Middle and Low based on the prediction overlap 
indexes (POIs) of the predicted DDI dataset. On the 
other hand, UniDomlnt merged two 3D structure-based 
DDI datasets and eight predicted DDI datasets. UniDo- 
mlnt provides numerical reliability scores for predicted 
DDIs by comparing an accuracy of the predicted data- 
sets. Although DOMINE and UniDomlnt provide a 
large amount of DDIs and compare the reliabilities 
between predicted DDIs with a unified format, some 
datasets are outdated and the total number of datasets 
is far below than that of currently published. They also 
ignored the scores measured by each prediction method 
of the datasets, so it is impossible to compare reliabil- 
ities between DDIs predicted in the same datasets. In 
addition, DOMINE and UniDomlnt do not provide PPI 
information mediated by DDIs. 

In this paper, we proposed an integrated analysis sys- 
tem for DDIs and their related protein interactions, 
called IDDI. We first combined three 3D structure- 
based DDI datasets and twenty predicted DDI datasets. 
To estimate the reliability of predicted DDIs, we devel- 
oped a novel scoring scheme considering the individual 
accuracy of each datasets, independency among the 
datasets and the internal prediction scores of the DDIs 
measured by each method. Total amount of DDIs is 
increased significantly compared to previous compre- 
hensive DDI databases, and the novel reliability scoring 
scheme achieved outstanding performance on sorting 
highly reliable DDIs. Furthermore, we joined our new 



DDI database with comprehensive PPI database, ComBi- 
Com [28], and constructed a unified analysis system 
with a unique interface for the protein interaction net- 
work analysis that enables exploring the protein and 
domain interaction mechanism together. 

Methods 

Data sources 

To construct a new comprehensive DDI database, we 
merged three 3D structure-based DDI datasets and 
twenty predicted DDI datasets based on the Pfam identi- 
fier. Since the datasets, including P-value, HiMAP, 
DomainGA and Top-down, use SCOP and InterPro 
identifier, we converted the SCOP domains to Pfam 
using SGD http://www.yeastgenome.org and the Inter- 
Pro domains to Pfam using a mapping table in InterPro 
website http://www.ebi.ac.uk/interpro. Although other 
datasets used Pfam identifier, they are consisted with 
different versions of Pfam and it is susceptible to display 
same interactions differently since some domains were 
changed or eliminated as Pfam is updated. We therefore 
unified Pfam version of all datasets into the same release 
of Pfam-A 24.0 version. All domains which are not 
available at Pfam 24.0 were discarded. All combined 
DDI datasets with their number of DDIs and domains 
are listed in Table 1. 

Assessment of the reliability score for the predicted DDI 

Each predicted DDI in our new database is evaluated by 
a reliability score. We considered three factors that 
affect reliabilities i) a confidence level of the each pre- 
dicted dataset, ii) an independency of the dataset and iii) 
a local prediction score of the DDIs measured by each 
dataset. 

Confidence score 

Each predicted dataset has different confidence level. 
Predicted DDIs are more reliable when they were found 
in more accurate datasets. 

To estimate the confidence levels, we used a weighted 
overlap method which measured a similarity between 
two datasets [27]. Weighted overlap (Wo) scores 
between each predicted dataset and a gold-standard 
positive (GSP) set could be a criterion of the confidence 
level. To prevent errors due to the difference of domains 
between two datasets, the weighted overlap method uses 
DDIs whose interacting domains were found in both 
datasets. For the two DDI datasets a and b, the Wo 
score is defined as: 

( 2(i a ni b ) \ 

Wo a , b = 

\la-+b + h^a/ 

where I is a set of DDIs, I a ^b is a subset of I a which 
interacting domains belong to both dataset a and b and, 
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Table 1 Statistics and confidence scores of DDI datasets in IDDI 



DDI Data [Ref.] 


No. of Domains 


No. of DDIs 
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Predicted Datasets InterDom [14] 


5,546 


1 44,793 


0.1476 


IDDDI n CI 


Q73 
0/3 


O.O.Q 

yyo 


n 3 1 77 
U.z I / / 


KGIDDI [16] 


1,559 


5,646 


0.0513 


LLZ [17] 


1,948 


5,737 


0.0915 


ME [18] 


1,226 


2,373 


0.5929 


PE [19] 


1,225 


2,856 


0.2348 


P-value [20] 


398 


596 


0.1047 


RCDP [21] 


484 


960 


0.2082 


RDFF [22] 


616 


2,413 


0.0993 


Top-down [23] 


4,303 


22,221 


0.3462 


TW [24] 


165 


170 


0.4254 



likewise, h^-a is a subset of lb which interacting domains 
are found in both datasets. 

The GSP set were generated using 3D structure-based 
DDIs extracted from 3DID, iPfam and PlnS. This GSP 
set contains total 6,768 verified DDIs. With the GSP set, 
the confidence score C of the predicted dataset d is 
defined as: 

Q = Wo d , GS p 

Table 1 shows confidence scores of each predicted 
dataset. Although the gap between two scores does not 
stand for absolute difference between two datasets, it is 
quite obvious that the DDIs are more reliable as they 
were predicted in higher confidence datasets. Based on 
confidence scores, the most reliable dataset is ME, fol- 
lowed TW, DIPD and Top-down. In contrast, RDFF, 
LLZ KGIDDI and DIMA-DPROF has low confidence 
scores which means DDIs predicted in these datasets 
have a high probability of false positive. 
Independence score 

Figure 1 shows unsupervised hierarchical clustering of 
the weighted overlap scores between every pairs of the 
datasets. The result reveals that DPEA and PE predicted 
quite similar DDIs because of their resembling predic- 
tion methodologies [26]. More than 95% of the DDIs 



predicted by DPEA are also found in PE and it causes 
overestimation problems for measuring reliability score 
of the DDIs. 

We, therefore, considered how well the datasets that 
predict the same DDI are independent from each other 
for estimating reliabilities. For every dataset d that con- 
tain DDI i, the independence score ID is defined as: 

1 

1 + W °d, e 

where e is the all datasets that predict i except d. For 
example, a dataset whose DDI is not overlapped with 
other datasets will receive an independence score of one. 
Prediction score 

Local prediction scores of DDIs measured by each pre- 
dicted dataset are also important key evidence for infer- 
ring reliabilities. Although DDIs were found in a same 
dataset, reliabilities of these DDIs are discrete depending 
on prediction scores. We scaled different ranges of origi- 
nal prediction score of each dataset from 0 to 1 by using 
an ordinal scaling method. Six of the datasets including 
HiMAP, KGIDDI, LLZ, RDFF, P-value and TW don t 
provide own prediction scores. DDIs predicted in these 
datasets receive an average prediction score of the DDIs 
found in the same number of the datasets. 
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Reliability score 

Using confidence scores, independence scores and pre- 
diction scores, we calculated a reliability score for each 
predicted DDI. For a predicted DDI i, a reliability score 
R is defined as: 

R; = J2 C d ■ lD dii ■ P d ,i 

where d is the all datasets that predicted i and P^, i is 
a prediction score of i measured by the dataset d. 

Integrated analysis system construction 

We constructed a web-based domain-domain interaction 
analysis interface to provide comprehensive exploration 
of DDI within protein interaction network. IDDI was 
constructed on Linux environment and tested for cross- 
browsing. This system is serviced on tomcat server with 
Oracle database and web pages were implemented in 
JAVA and JAVA Server Pages (JSP). Figure 2 presents 
the system architecture of IDDI. IDDI is composed of 
three components - a database with an update module, 
analysis module, and web user interface. Data used in 
IDDI is imported from each reference database to our 
database; and our database is semi-automatically 
updated by the update module. Based on the database 
information, system provides search result for given 
query and analysis for given query is executed on analy- 
sis module. Web user interface mediates 



communications between end user and analysis module 
by user-friendly webpages. 

IDDI doesn't include our new integrated DDI data- 
base only but also protein interactions from ComBiCom 
[28] to grasp the detailed interactions in both domain 
and protein level. ComBiCom, developed in our group, 
is the database system providing 257,902 non-redundant 
binary PPIs and 11,964 protein complexes from 9 
experimentally identified PPI databases, which cover the 
most of publically available PPI information. In order to 
mapping of domains to their containing protein, SwissP- 
fam available at the Pfam site was used. It provides 
SWISS-PROT and TrEMBL proteins with their assigned 
Pfam domains. In addition, we stored protein functional 
annotations obtained from the Gene Ontology to build 
a reference set of functional information. An update 
module is also implemented to semi-automatically 
update database. 

IDDI provides four kinds of searching services: protein 
search, domain search, PPI search, and DDI search. This 
searching system is based on PFAM ID and Uniprot 
accession number for domain and protein classifier, 
respectively. PPI relationship was searched from ComBi- 
Com, and protein function information is annotated 
from Gene Ontology. To provide comprehensive search- 
ing system, we need to map proteins with their con- 
tained domains and SwissPfam was used to map 
proteins with their corresponding domains. Using this 
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Resources 



IDDI 



Protein, Function, 
PPI, Domain, Complex 

► Uniprot 

► Gene Ontology 

► ComBiCom 

► PFAM 

► S 



Database Construction 



Web Interface and Analysis Tools 



Protein Search 
Interface 



DDI Search 
Interface 
► PfamACpair 




Protein Report 

► General information of protein 

► Protein function 

► Protein interaction partners 




PPI Report 

► General information of proteins 

► Protein functions 

► Possible DDIs in the PPI 



Protein Filtering 



► Domain filtering 

► F 



Domain Report 

► General information of domain 

► Domain function 

► Domain interaction partners 



Domain Filtering 



► Function filtering 

► Reliability score 



DDI Report 

► General information of domain 

► Domain function 

► PPIs mediated by the DDI 



Figure 2 Schematic illustration of resource collection, database construction and representation of IDDI 



mapping data, IDDI could provide possible DDIs for 
protein search or possible PPIs for DDI search. 

Results 

Data statistics 

Our new DDI database currently contains 204,705 
unique DDIs between 7,351 distinct Pfam domains. 
Among DDIs, 6,768 interactions were combined from 
3D structure-based datasets and 202,914 interactions 
were extracted from predicted datasets. It is superior to 
currently available comprehensive DDI databases such 
as DOMINE and UniDomlnt in a number of both inter- 
actions and domains (Figure 3). Massive amount of 
DDIs are increased due to the employment of the latest 
released datasets and new datasets. Several DDI datasets 
such as iPfam, 3DID, InterDom and DIMA show great 
increased interactions at newly updated releases. Intro- 
duction of the new datasets such as PInS, APMM, 
IPPRI, Top-down, LLZ and TW also provide 47,928 
DDIs including 19,609 novel interactions. 

Performance evaluation of reliability scoring scheme in 
IDDI 

Unlike DOMINE, both IDDI and UniDomlnt have lin- 
ear scoring schemes. Although they considered the 



confidence of the predicted datasets in common, IDDI 
reflect the independency of the datasets and the predic- 
tion scores of the DDIs additionally. We tried to evalu- 
ate the performance of datasets and scoring schemes 
used in IDDI and UniDomlnt with ROC curves (Figure 
4). For the sake of fairness, the 3D structure-based DDIs 
in IDDI were considered as the GSP set to both IDDI 
and UniDomlnt where all of 3D structure-based DDIs 
in UniDomlnt are included in IDDI. 

Figure 4(a) shows ROC curves of IDDI and UniDo- 
mlnt with their own DDI datasets and scoring schemes. 
The ROC curves demonstrate that IDDI has high true 
positive rate than UniDomlnt at same false positive rate. 
It indicates IDDI has greater power to filter more reli- 
able DDIs. UniDomlnt combines only 8 predicted data- 
sets and the reliability score of UniDomlnt is heavily 
dependent on ME owing to its overwhelming accuracy. 
It inhibits an accurate measurement of the reliability 
scores. On the other hand, IDDI include additional pre- 
dicted datasets including TW, DIPD and Top-down 
which have as high confidence as ME. It prevents the 
excessive focus of the reliability scores on a single pre- 
dicted dataset. For example, interaction between Signal 
peptide binding domain (PF02978) and SRP19 protein 
(PF01922), the known DDI searched in iPfam, is found 
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only in the p-value method among 13166 predicted 
DDIs of UniDomlnt and has low reliability score, 
0.0548. This score is ranked in the top 87.3% of the 
total predicted interactions, which means it has high 
possibility of being false positive. On the other hand, 
IDDI has additional prediction information for the same 
DDI from the updated version of InterDom and DIPD, 
APMM and Top-down, which are not existing datasets 
in UniDomlnt. IDDI's reliability score for this DDI is 
ranked in top 0.42% of the total predicted interactions 
and represents high probability of being true positive. 



Figure 4(b) shows a comparison between IDDI and 
UniDomlnt' s scoring schemes with same DDI datasets 
in IDDI. A result reveals that additional factors in our 
new scoring scheme are efficient enough to filter reliable 
interactions. UniDomlnt considers only the confidence 
level of the predicted datasets for accessing the reliabil- 
ity score to the each DDI. As a result, comparisons 
between DDIs found in the same dataset are impossible 
because all of them receive same scores. It also causes 
an overestimation problem of the reliability scores. DDIs 
in a high- confidence dataset are accessed high reliability 
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Table 2 Comparison of PPI coverage rates in different 
DDI databases 





3D-structure based 
only 


UniDomlnt 


DOMINE 


IDDI 


PPI with DDI 


25,944 


55,788 


60,758 


129,922 


PPI without 


231,958 


202,114 


197,144 


127,980 


DDI 










Rate (%) 


10.0 


21.6 


23.6 


50.4 



scores even if they are more likely to false positive 
because of their low prediction scores. 

We tested the average accuracy for reliability score 
cut-off in IDDI. The result reveals that the cut-off of 
0.329 has the highest accuracy, 0.98. For reference, cut- 
off value that shows 0.90 of accuracy was 0.102 and 
21027 DDIs were included within the cut-off value. End 
user can determine the cut-off value for research 



purpose and those DDIs which have cut-off value for 
high accuracy may show more reliable results. 

Comparison of PPI coverage rates 

We tried to compare the PPI coverage rate of 3D struc- 
ture-based DDIs, DOMINE, UniDomlnt and IDDI by 
using binary PPIs in ComBiCom. We defined that the 
PPI is covered when at least one DDI are found between 
interacting proteins. 

Table 2 shows the number of covered PPI, the num- 
ber of non-covered PPIs and PPI coverage rate for each 
DDI data. 3D structure-based DDIs cover only 10.0% of 
PPIs. On the other hand, IDDI covered 50.4% of PPIs 
and it is more than twice the coverage rate of DOMINE. 

Functionality of the integrated interaction analysis system 

IDDI was constructed to provide comprehensive search on 
protein or domain to give an insight on detailed 



(a) 



(b) 



All Function 



PF08563 (P53-TAD) 



Entry Information 



. Filler 



PPI Information 






UN1PR0T AC AT 


UNIPROT ID 


UNIPROT NAME 


ppi 




DDI AT Complex A ▼ 


P60484 


PTEN_HUMAN 


Phosphatidylinositol -J. 4, 5 triphosphate J phosphatase and dual 
specificity protein phosphatase PTEN 


O Q O O OQOC 


1 0 


Q05397 


FAK1_HUMAN 


Focal adhesion kinase 1 


QQQQQ 


1 0 


Q96M61 


MAGBI_HUMAN 


Melanoma- associated antigen B18 


nBQfflOQCTC 


1 0 


099608 


NEC0_HUMAN 


Necdin 


QQQffiQQQQ 


1 0 



C 101 2270 
C1012271 



P04637 P45481 Q00987 



Q Q QQQ £j 



(c) 



Total 186 interaction partners of PF08563' 



Pf 00012 
PF00018 
PF00023 
PF00036 



(d) 



iproi3I26 hsp7o □□QQQQQDQQQQQOQQQQQQQQC 0.00079367 6 

IPRO01452 SH3_i QQQQQQQQQQQQOQQQQQQQQQQ 0.00109791 4 

IPR002110 Ank QQQQQQQQQQQQQQQQQQQQQQQ 0.00556890 10 

iproi8248 efhand □□QQQOQDQQQQOQQQCQQQQQC 0.00113759 0 




P53TAD MAGE □□□□□□□□□□□□□□□□□□□(DCS 0.05532852 2 

Figure 5 Example for IDDI functionalities (a) Protein interaction partners of P53 (P04637) having DDI relationship with the P53 
transactivation domain (PF08563) (b) Complex information containing P53 and MDM2 (Q00987) (c) Domain interaction partners of 
P53 transactivation domain (d) DDI information between P53 and Necdin (Q99608). 
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interaction. We lay emphasis on easy information access 
among related proteins, domains, and complexes. The sys- 
tem includes protein, domain, PPI and DDI search system 
(see Figure 5). First, in protein search, user can check the 
information of query protein such as protein function 
referring from Gene Ontology, containing domains from 
SwissPfam, and known binary PPIs and complexes related 
to the protein referring from ComBiCom (see Figure 5(a) 
and 5(b)). Users can easily access to the detailed informa- 
tion about each listed containing domains or PPIs to 
investigate the working mechanism of the protein. Second, 
in domain search, DDIs related to the query domain and 
proteins which contains the domain is provided (see Fig- 
ure 5(c)). Third, in PPI search, user can search for two 
proteins to check whether they are known interacting pro- 
teins. Also, the system analyzes the domains of each pro- 
tein and predicts the possible DDIs between two proteins 
(see Figure 5(d)). Last, in DDI search, given a query of two 
domains, IDDI checks whether the two domains have DDI 
relationship and predicts possible PPIs induced from the 
DDI. 

Example of integrated interaction analysis 

IDDI provides comprehensive searching service to explore 
the relationship of proteins and domains. It can be used 
for gene selection for study by prioritization of list of pro- 
teins with using filtering function. In this section, we pro- 
vide an example of p53 interacting target analysis. Figure 5 
(a) and 5(d) illustrates the example of integrated analysis 
for the specific PPIs and DDIs of p53 protein. Interacting 
partners for p53 can be searched using protein search and 
the list of interacting partners are subdivided by domain 
interaction. Among them, those which have domain inter- 
action with the transactivation domain of p53 can be 
selected using filtering option (Figure 5(a), only the part of 
the list is shown here). With this specification of interact- 
ing partners, total 11 interacting partners were selected 
from the 355 partners of p53. The specified DDI can be 
further investigated by the "DDI" link as shown in Figure 
5(d). In this example, as a summary, it shows that the 
Mage domain of Necdin interact with the transactivation 
domain of p53. Actually, the interaction mechanism of 
both domains for the function of two proteins has been 
turned out by the elaborated experimental works [29]. The 
investigation can be expanded more with other selected 
proteins or by tracing the other proteins having Mage 
domain by using our system. As in this example, our sys- 
tem will enable more sophisticated and efficient investiga- 
tion about the protein interaction and their function by 
providing an integrated analysis scheme of DDIs and PPIs. 

Conclusions 

We proposed a new unified interaction analysis system, 
IDDI, which enables the comprehensive analysis of 



protein and domain interactions with their interconnec- 
tivity. Large increase of total DDIs enables high inter- 
connectivity of DDIs and PPIs and an advanced scoring 
scheme enhances the reliability of integrated DDIs in a 
substantial amount. Furthermore, IDDI provides a con- 
venient interface to investigate the protein interaction 
with detail domain interaction. IDDI will be a valuable 
resource for the in-depth study of interaction mechan- 
ism and thereby to derive the functional implication of 
interacting proteins. 
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